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Theoretical developments on cross validation (CV) have mainly 
focused on selecting one among a list of finite-dimensional models 
(e.g., subset or order selection in linear regression) or selecting a 
smoothing parameter (e.g., bandwidth for kernel smoothing). How- 
ever, little is known about consistency of cross validation when ap- 
plied to compare between parametric and nonparametric methods or 
within nonparametric methods. We show that under some conditions, 
with an appropriate choice of data splitting ratio, cross validation is 
consistent in the sense of selecting the better procedure with proba- 
bility approaching 1. 

Our results reveal interesting behavior of cross validation. When 
comparing two models (procedures) converging at the same nonpara- 
metric rate, in contrast to the parametric case, it turns out that 
the proportion of data used for evaluation in CV does not need to 
be dominating in size. Furthermore, it can even be of a smaller order 
than the proportion for estimation while not affecting the consistency 
property. 

1. Introduction. Cross validation (e.g., Allen [2], Stone [25] and Geisser 
[9]) is one of the most commonly used model selection criteria. Basically, 
based on a data splitting, part of the data is used for fitting each competing 
model (or procedure) and the rest of the data is used to measure the per- 
formance of the models, and the model with the best overall performance is 
selected. There are a few different versions of cross-validation (CV) meth- 
ods, including delete-1 CV, delete-^ (k > 1) CV and also generalized CV 
methods (e.g., Craven and Wahba [6]). 

Cross validation can be applied to various settings, including parametric 
and nonparametric regression. There can be different primary goals when 
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applying a CV method: one mainly for identifying the best model/procedure 
among the candidates and another mainly for estimating the mean function 
or for prediction (see, e.g., Geisser [9]). A number of theoretical results have 
been obtained, mostly in the areas of linear regression and in smoothing 
parameter selection for nonparametric regression. In linear regression, it has 
been shown that delete- 1 and generalized CVs are asymptotically equivalent 
to the Akaike Information Criterion (AIC) [1] and they are all inconsistent in 
the sense that the probability of selecting the true model does not converge 
to 1 as n goes to oo (see Li [15]). In addition, interestingly, the analysis of 
Shao [19] showed that in order for delete-fc CV to be consistent, k needs to be 
dominatingly large in the sense that k/n—^1 (and n — k — > oo). Zhang [35] 
proved that delete-fc CV is asymptotically equivalent to the Final Prediction 
Error (FPE) criterion when k — > oo. The readers are referred to Shao [20] for 
more asymptotic results and references on model selection for linear regres- 
sion. In the context of nonparametric regression, delete- 1 CV for smoothing 
parameter selection leads to consistent regression estimators (e.g., Wong [32] 
for kernel regression and Li [14] for the nearest-neighbor method) and leads 
to asymptotically optimal or rate-optimal choice of smoothing parameters 
and/or optimal regression estimation (see, e.g., Speckman [22] and Burman 
[5] for spline estimation, Hardle, Hall and Marron [12], Hall and Johnstone 
[11] and references therein for kernel estimation). Gyorfi et al. [10] gave risk 
bounds for kernel and nearest-neighbor regression with bandwidth or neigh- 
bor size selected by delete-1 CV. See Opsomer, Wang and Yang [17] for a 
review and references related to the use of CV for bandwidth selection for 
nonparametric regression with dependent errors. 

In real- world applications of regression, in pursuing a better estimation 
accuracy, one may naturally consider the use of cross validation to choose 
between a parametric estimator and a nonparametric estimator (or at least 
to understand their relative performance). Similarly, when different types 
of nonparametric estimators are entertained as plausible candidates, cross 
validation is also applicable to choose one of them. Recently, a general CV 
methodology has been advocated by van der Laan, Dudoit, van der Vaart 
and their co-authors (e.g., van der Laan and Dudoit [26], van der Laan, 
Dudoit and van der Vaart [27] and van der Vaart, Dudoit and van der Laan 
[28]), which can be applied in other contexts (e.g., survival function esti- 
mation). Risk bounds for estimating the target function were derived and 
their implications on adaptive estimation and asymptotic optimality were 
obtained. When CV is used for the complementary purpose of identifying 
the best candidate, however, it is still unclear whether CV is generally con- 
sistent and if the data splitting ratio has a sensitive effect on consistency. 
For successful applications of CV in practice, a theoretical understanding of 
these issues is very much of interest. 
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In this paper, we address the aforementioned consistency issue and show 
that a voting-based cross validation is consistent for comparing general re- 
gression procedures when the data splitting ratio is properly chosen. 

In the context of linear regression, Shao's result [19] implies that with 
k/n not converging to 1, dclete-fc CV does not differentiate well between two 
correct models. However, in nonparametric regression, it turns out that this 
is not the case. In fact, as long as at least one of the competing procedures 
converges at a nonparametric rate, the estimation size and evaluation size 
can be of the same order in data splitting, and sometimes the estimation 
size can even be the dominating one. 

In the settings where theoretical properties of CV were investigated be- 
fore, the best model (in the linear regression context) or the best smoothing 
parameter (in the case of kernel regression) exists in a natural way. When 
comparing two general estimators, the issue becomes more complicated. In 
this paper, we compare two estimators in terms of a loss function and the 
consistency in selection is established when one estimator is better than the 
other in that sense. 

The paper is organized as follows. In Section 2 we set up the problem. 
The main result is presented in Section 3, followed by simulation results in 
Section 4. Concluding remarks are in Section 5. The proof of the main result 
is in Section 6. 

2. Problem setup. Consider the regression setting 

Y i = f(X i ) + e i , l<i<n, 

where (Xj,li)" =1 are independent observations with Xi i.i.d. taking values 
in a d-dimensional Borel set X C R d for some d > 1 , / is the true regression 
function and £\ are the random errors with E(si\Xi) = and E(ef\Xi) < oo 
almost surely. The distribution of Xi is unknown. 

Rates of convergence of various popular regression procedures have been 
well studied. Under the squared L2 loss (and other closely related perfor- 
mance measures as well), for parametric regression (assuming that the true 
regression function has a known parametric form) , estimators based on max- 
imum likelihood (if available) or least squares usually converge at the rate 
For nonparametric estimation, the convergence rates are slower than 
n _1 with the actual convergence rate depending on both the regression pro- 
cedure and the smoothness property of the true regression function. 

In applications, with many regression procedures available to be applied, 
it is often challenging to find the one with best accuracy. One often faces 
the issue: Should I go parametric or nonparametric? Which nonparametric 
procedure to use? As is well known, nonparametric procedures have more 
flexibility yet converge suboptimally compared to estimation based on a 
correctly specified parametric model. Sometimes, for instance, there is a 
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clear linear trend and naturally a simple linear model is a good candidate. 
It may be unclear, however, whether a nonparametric method can provide 
a better estimate to capture a questionable slight curvature seen in the 
data. For a specific example, for the rather famous Old Faithful Geyser data 
(Weisberg [31]), there were several analyses related to the comparison of 
linear regression with nonparametric alternatives (see, e.g., Simonoff [21] 
and Hart [13]). 

For simplicity, suppose that there are two regression procedures, say 5\ 
and 62, that are considered. For example, <5i may be simple linear regression 
and 82 may be a local polynomial regression procedure (see, e.g., Fan and 
Gijbels [8]). For another example, 5\ may be a spline estimation procedure 
(see, e.g., Wahba [29]) and 62 may be a wavelet estimation procedure (see, 
e.g., Donoho and Johnstone [7]). Based on a sample (Xi,Yi)f =1 , the regres- 
sion procedures 5\ and 62 yield estimators f n ,i{%) and f n ,2(x), respectively. 
We need to select the better one of them. 

Though for simplicity we assumed that there are only two competing 
regression procedures, similar results hold when a finite number of candidate 
regression procedures are in competition. We emphasize that our focus in 
this work is not on tuning a smoothing parameter of a regression procedure 
(such as the bandwidth for kernel regression). Rather, our result is general 
and applicable also for the case when the candidate regression procedures 
are very distinct, possibly with different rates of convergence for estimating 
the regression function. 

Cross validation is a natural approach to address the above model/procedure 
comparison issue. It has the advantage that it requires mild distributional as- 
sumptions on the data and it does not need to find characteristics such as de- 
grees of freedom or model dimension for each model/procedure (which is not 
necessarily easy for complicated adaptive nonparametric procedures or even 
parametric procedures where model selection is conducted). To proceed, we 
split the data into two parts: the estimation data consist of Z 1 = (JQ, ^i)^ 
and the validation data consist of Z 2 = (Xi,Yi)f = Let ri2 = n — n\. We 

apply S\ and 82 on Z 1 to obtain the estimators f nii i(x) and f ni ,2{x), respec- 
tively. Then we compute the prediction squared errors of the two estimators 
on Z 2 : 

n 

(2.1) cv(/ niJ )= ]T (iw niJ (x 4 )) 2 , 3 = 1,2. 

i=n\+l 

If CV(/ nii i) < CV(/ ni) 2), Si is selected and otherwise 62 is chosen. We call 
this a delete-77-2 CV. For investigating the consistency property of CV, in 
this work we only consider the case when min(m,rt2) — ► 00. 

A more balanced use of the data is CV with multiple data splittings. A 
voting-based version will be considered as well. 
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3. Consistency of cross validation. In this section, we first define the 
concept of consistency when two general regression procedures are com- 
pared and then state the conditions that are needed for our result. Note 
that there are two types of consistency results on CV. One is from the selec- 
tion perspective (which concerns us if the true model or the best candidate 
procedure is selected with probability tending to 1) and the other is in terms 
of convergence of the resulting estimator of the regression function (either in 
probability or in risk under a proper loss function). The former is our focus 
in this work. Note that in general, neither notion guarantees the other. Nev- 
ertheless, for a global loss function, consistency in selection usually implies 
consistency in estimation of the regression function at least in probability. 
In contrast, consistency in estimation can be far away from consistency in 
selection. For example, with two nested models, if both models are correct, 
then reasonable selection rules will be consistent in estimation but not nec- 
essarily so in selection. See Wegkamp [30] for risk bounds for a modified CV 
(with an extra complexity penalty) for estimating the regression function. 

3.1. Definitions and conditions. We first give some useful definitions for 
comparing estimators in terms of probability. 

Let L(9,9) be a loss function. Consider two estimation procedures 5 and 
5' for estimating a parameter 9. Let {Q n ,i\^=\ an d {9 n ,2\^Li be the corre- 
sponding estimators when applying the two procedures at sample sizes 1, 
2, . . . , respectively. 

Definition 1. Procedure S (or {9 n ,i}^=i, or simply 9 n ,i) is asymptoti- 
cally better than 5' (or {9 n< 2}^ = i, or 6 n ,2) under the loss function L{9,9) if 
for every < e < 1 , there exists a constant c e > such that when n is large 
enough, 

(3.1) P(L(9, 9 n>2 ) > (1 + c e )L(9,9 n ,i)) >l-e. 

Remarks. 1. Suppose that 9 n> \ and 9 n \ are asymptotically equivalent 
under the loss function L(9,9) in the sense that L(9,9 ni i)/L(9,9 nt i) — ► 1 in 
probability. If 9 n .\ is asymptotically better than (9 nj 2, then obviously 9 n> i is 
also asymptotically better than n 2 . 

2. It seems clear that to evaluate a selection method that chooses between 
two procedures, the procedures need to be rankable. When two procedures 
are asymptotically equivalent, one may need to examine finer differences 
(e.g., higher order behavior) to compare them. When two procedures can 
have "ties" in the sense that on a set with a nonvanishing probability the 
two procedures are identical or behave the same, it becomes tricky to define 
consistency of selection generally speaking. 
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The requirement in (3.1) is sensible. Obviously, if L{9,9 n ^)/L{9,9 n ^i) — ► 

00 in probability, by the definition, 5 is asymptotically better than 5' under 
L. The concept is also useful for comparing procedures that converge at the 
same order. In particular, it is worth pointing out that in the context of linear 
regression with an appropriate loss (e.g., global L2 loss), when two correct 
models are compared, typically the estimator based on the one with a lower 
dimension is asymptotically better. As is expected, in the same context, an 
incorrect subset model yields an asymptotically worse estimator than any 
one based on a correct model. 

We next define consistency of a model/procedure selection rule from the 
perspective of selecting the best (or better when there are only two candi- 
dates) model/procedure. Unless otherwise stated, consistency of CV refers 
to this notion in the rest of the paper. 

Definition 2. Assume that one of the candidate regression procedures, 
say 5*, is asymptotically better than the other candidate procedures. A selec- 
tion rule is said to be consistent if the probability of selecting 5* approaches 

1 as n — > 00. 

This concept of consistency for a selection rule is more general than the 
definition that a selection rule is consistent if the true model (when existing 
and being considered) is selected with probability approaching 1. Obviously 
the latter does not apply for comparing two general regression procedures. 

Let {a n } be a sequence of positive numbers approaching zero. The fol- 
lowing simple definition concerns the rate of convergence in probability (cf. 
Stone [23]). 

Definition 3. A procedure 5 (or {6 n }^ =1 ) is said to converge exactly at 
rate {a n } in probability under the loss L if L(6,6 n ) = O p (a n ), and for every 
< e < 1, there exists c e > such that when n is large enough, P(L(6, 6 n ) > 
c e a n ) > 1 - e. 

Clearly, the latter part of the condition in the above definition says that 
the estimator does not converge faster than the rate a n . 
Define the L q norm 

\l/9 

\f(x)\ q P x (dx)j , forl<g<oo, 
I esssup |/|, for q = 00, 

where Px denotes the probability distribution of X\ . 

We assume that /„ 1 converges exactly at rate p n in probability and f n p 
converges exactly at rate q n in probability under the Li loss. Note that p n 
and q n may or may not converge at the same rate. 
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Condition {Error variances). The error variances E(ef\X,i) are upper 
bounded by a constant a 2 > almost surely for all i > 1 . 

This is a mild condition on the errors, which does not require them to be 
identically distributed. We also need to control the sup-norm of the estima- 
tors. 

Condition 1 (Sup-norm of the estimators). There exists a sequence of 
positive numbers A n such that for j = 1, 2, ||/ — /nj||oo = O p (A n ). 

The condition almost always holds. But for our main result to be helpful, 
the constants A n need to be suitably controlled. 

We mention some useful sufficient conditions that imply Condition 1. One 
is that for j = 1,2, ||/ — / n ,j||oo is bounded in probability. Another stronger 
condition is that E\\f — f n ,j\\oo, is uniformly bounded in n for j = 1,2. It 
clearly holds if with probability 1, ||/ — f n ,j\\oo is upper bounded by a con- 
stant A > 0, which is satisfied if the true regression function is bounded 
between two known constants and the estimators are accordingly restricted. 

In order to have consistency in selection, we need that one procedure is 
better than the other. 

Condition 2 (One procedure being better). Under the L2 loss, either <5i 
is asymptotically better than 62, or 62 is asymptotically better than 5i. 

Clearly there are situations where neither of two competing procedures is 
asymptotically better than the other. In such a case, the concept of consis- 
tency is hard to define (or may be irrelevant). Note that under Condition 2, 
if 5\ is asymptotically better than #2, then we have p n = 0(q n ). Clearly, 
if there exists a constant C > 1 such that with probability approaching 1, 

11/ - fn,2h > C\\f - f n ,ih or even 11/ - fn,2h/\\f ~ fn,ih °o in proba- 
bility, then Condition 2 is satisfied with 5i being better. 

Another quantity, namely, ||/ — /njlU/H/ — fn,j\\2, is involved in our anal- 
ysis. Obviously, the ratio is lower bounded by 1 and there cannot be any 
general positive upper bound. For various estimators, the ratio can be con- 
trolled. We need the following condition. 

Condition 3 (Relating L4 and L2 losses). There exists a sequence of 
positive numbers {M n } such that for j = 1,2, \\f - f n ,j\\i/\\f ~ fnjh = 
O p (M n ). 

For many familiar infinite-dimensional classes of regression functions, the 
optimal rates of convergence under L4 and L2 are the same. If we consider 
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an optimal estimator (in rate) under L4, then we can take M n to be 1 
for a typical / in such a function class. Parametric estimators typically 
have the ratio upper bounded in probability (i.e., M n = 1) under some mild 
conditions. 

For some nonparametric estimators, the sup-norm risk is often of only 
a slightly higher order than that under L p for p < 00. For example, for 
Holder classes S(/3,L) = {/: \f^ m \x) - f {m) (y)\ < L\x - y\ a }, where m = 
[(3\ is an integer, < a < 1 and a = (3 — m; also ||/||oo is bounded, or for 
Sobolev classes, the rates of convergence under the sup-norm distance and 
L p [p < 00) are different only by a logarithmic factor (see, e.g., Stone [24] and 
Nemirovski [16]). If one takes an optimal or near-optimal estimator under 
the Lqo loss, Condition 3 is satisfied typically with M n being a logarithmic 
term. 



3.2. The main theorem. Let /* = 1 if 5\ is asymptotically better than 82 
and /* = 2 if 62 is asymptotically better than Si. Let I n = 1 if CV(/ nii i) < 
CV(/ ni)2 ) and otherwise I n = 2. 

Theorem 1. Under Conditions 0-3, if the data splitting satisfies 
(1) ri2 — ► 00 and ni — > 00; (2) n2M~^ — ► 00; and (3) 



y/n2~m&x(p ni ,q ni )/(l + A ni ) -> 00, 
then the delete-ri2 CV is consistent, that is, P{I n 7^ /*) — > as n —> 00. 

Remarks. 1. The third requirement above is equivalent to ^/th max(p ni , 
q ni ) — ► 00 and y^max^j, q ni )/A ni — > 00. The latter has no effect, for ex- 
ample, when the estimators being compared by CV both converge in the 
sup-norm in probability. The effect of M ni is often more restrictive than 
A ni and sometimes can be more complicated to deal with. 

2. Theorem 1 is stated under a single data generating distribution. Obvi- 
ously, model selection becomes useful when there are various possible data 
generating mechanisms (e.g., corresponding to different parametric or non- 
parametric families of regression functions or different assumptions on the 
errors) that are potentially suitable for the data at hand. In the linear re- 
gression context, a number of finite-dimensional models is considered, and 
in the literature a model selection rule is said to be consistent when the true 
model is selected with probability going to 1 no matter what the true model 
is (assumed to be among the candidates). Clearly our theorem can be used 
to get such a result when multiple scenarios of the data generating process 
are possible. To that end, one just needs to find a data splitting ratio that 
works for all the scenarios being considered. 



CONSISTENCY OF CROSS VALIDATION 



9 



3. A potentially serious disadvantage of cross validation is that when 
two candidate regression procedures are hard to distinguish, the forced ac- 
tion of choosing a single winner can substantially damage the accuracy of 
estimating the regression function. An alternative is to average the esti- 
mates. See Yang [33, 34] for references and theoretical results on combining 
models/procedures and simulation results that compare CV and a model 
combining the procedure Adaptive Regression by Mixing (ARM). 

It should be pointed out that the conclusion is sharp in the sense that 
there are cases in which the sufficient conditions on data splitting are also 
necessary. See Section 3.4 for details. 

The "ideal" norm conditions are that M n = O(l) and A n = 0(1). Since 
almost always p n and q n are at least of order n _1//2 , the most stringent third 
requirement then is 712/711 — > oo. 

Corollary 1. Under the same conditions as in Theorem 1, if M n = 
(9(1) and A n = 0(1), then the delete-n2 CV is consistent for each of the 
following two cases: 

(i) m&x(p n ,q n ) = 0(n -1 / 2 ), with the choice n\ — > oo and 712/rai — ► oo; 

(ii) max(p n , q n )n 1 ^ 2 — > oo, with any choice such that m — > oo and n\/ri2 = 
O(l). 

From the corollary, if the L q norms of f n i — / and f n % — f behave "nicely" 
for q = 2,4 and oo, then when at least one of the regression procedures 
converges at a nonparametric rate under the L2 loss, any split ratio in CV 
works for consistency as long as both sizes tend to infinity and the estimation 
size is no bigger (in order) than the evaluation size. Note that this splitting 
requirement is sufficient but not necessary. For example, if m&x(p n ,q n ) = 

0(n -1 / 4 ), then the condition y / 772~max(p ni ,(/ ni ) — > 00 becomes n-ijn^' 2 ' — ► 
00 [i.e., n\ = 0(713)]. Thus, for example, we can take n\ = n — [^/nlogn\ 
and 712 = [\/^log^J , in which case the estimation proportion is dominating. 
This is in sharp contrast to the requirement of a much larger evaluation 
size (712/711 — > 00) when both of the regression procedures converge at the 
parametric rate n" 1 , which was discovered by Shao [19] in the context of 
linear regression with fixed design. From Shao's results, one may expect 
that when p n and q n are of the same order, the condition 712/711 — ► 00 may 
also be needed for consistency more generally. But, interestingly, our result 
shows that this is not the case. Thus there is a paradigm shift in terms of 
splitting proportion for CV when at least one of the regression procedures 
is nonparametric. 

The result also suggests that in general delete- 1 (or delete a small fraction) 
is not geared toward finding out which candidate procedure is the better one. 
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Note that for various estimators based on local averaging or series expan- 
sion, we do not necessarily have the "ideal" norm requirement in Condition 3 
met with M n = 0(1), but may have ||/ - / n || 4 < ||/ - / n ||oo < a n \\f - f n \\ 2 
for some deterministic sequence a n (possibly converging to oo at a polyno- 
mial order n" ( with 7 > 0). Then for applying the theorem, we may need 
n2/rv[ 1 — > 00. This may or may not add any further restriction on the data 
splitting ratio in CV beyond the other requirements in the theorem, depend- 
ing on the value of 7 in relation to p n ,Qn and A n . 

3.3. CV with multiple data splittings. For Theorem 1, the data splitting 
in CV is done only once (and hence the name cross validation may not be 
appropriate there). Clearly, the resulting estimator depends on the order of 
the observations. In real applications, one may do any of the following: (1) 
consider all possible splits with the same ratio (this is called multifold CV; 
see, e.g., Zhang [35]); (2) the same as in (1) but consider only a sample of all 
possible splits (this is called repeated learning-testing; see, e.g., Burman [4]); 
(3) divide the data into r subgroups and do prediction one at a time for each 
subgroup based on estimation using the rest of the subgroups (this is called 
r-fold CV; see Breiman, Friedman, Olshen and Stone [3]). When multiple 
splittings are used, there are two natural ways to proceed. One is to first 
average the prediction errors over the different splittings and then select the 
procedure that minimizes the average prediction error. Another is to count 
the number of times each candidate is preferred under the different split- 
tings and then the candidate with the highest count is the overall winner. 
Such a voting is natural to consider for model selection. To make a distinc- 
tion, we call the former (i.e., CV with averaging) CV-a and the latter (i.e., 
CV with voting) CV-v. When focusing on linear models with fixed design, 
theoretical properties for multifold CV-a or r-fold CV-a were derived under 
assumptions on the design matrix (see, e.g., Zhang [35] and Shao [19]). We 
next show that under the same conditions used for a single data splitting 
in the previous subsection, CV-v based on multiple data splittings is also 
consistent in selection when the observations are identically distributed. 

Let 7r denote a permutation of the observations. Let CV 7r (/ ni j) be the 
criterion value as defined in (2.1) except that the data splitting is done 
after the permutation n. If CV 7r (/ nii i) < CV^(/ nii 2), then let r n = 1 and 
otherwise let t w = 0. 

Let IT denote the set of all n\ permutations of the observations. If 
Siren r vr > Y' then we select S\, and otherwise 82 is selected. 

Theorem 2. Under the conditions in Theorem 1 and that the observa- 
tions are independent and identically distributed, the CV-v procedure above 
is consistent in selection. 
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Proof. Without loss of generality, assume that 5\ is better than 62- 
Let W denote the values of (X\,Yi), . . . , (X n ,Y n ) (ignoring the orders). Un- 
der the i.i.d. assumption on the observations, obviously, conditional on W, 
every ordering of these values has exactly the same probability and thus 
P(CV(/ ni>1 ) < CV(/ ni , 2 )) is equal to 

L^en 7 



EP(CV(J nitl ) < CV(f ni , 2 )\W) = E 



From Theorem 1, under the given conditions, P(CV(/ ni) i) < CV(/ mj 2)) — ► 
1. Thus E(J2-K£n T 7r/ n — > 1- Since Z^ei! r ?r/ n ! * s between and 1, for its 
expectation to converge to 1, we must have J2nen T Tr/ n ^ ~ * 1 m probability. 
Consequently P(J2n£ii T ^ — n '/2) — > 1. This completes the proof of Theo- 
rem 2. □ 

From the above proof, it is clearly seen that the consistency result also 
holds for the aforementioned voting-based repeated learning-testing and r- 
fold CV-v methods. 

Note that for the CV-a methods, for each candidate procedure, we av- 
erage CV(/ niJ ) over the different data splittings first and then compare 
the criterion values to select a winner. Intuitively, since the CV-v only 
keeps the ranking of the procedures for each splitting, it may have lost 
some useful information in the data. If so, the CV-v may perform worse 
than CV-a. However, in terms of the consistency property in selection, it 
is unlikely that the two versions of CV are essentially different. Indeed, 
if - p (E 7 r G n( CV 7r(/n 1 ,2) - CV 7r (/„ li i)) > 0) ->■ 1, due to symmetry, one ex- 
pects that for the majority of splittings in II, with high probability, we have 

CV.(/ ni , 2 ) - CV^x) > 0. Then P(E. en V.(/n 1 , 2 )-CV.(/„ 1 , 1 )>0) > ^ 
is close to 1. 

Based on the above reasoning, we conjecture that the two CV methods 
generally share the same status of consistency in selection (i.e., if one is 
consistent so is the other). This of course does not mean that they perform 
similarly at a finite sample size. In the simulations reported in Section 4, 
we see that for comparing two converging parametric procedures, CV-v per- 
forms much worse than CV-a, but when CV-a selects the best procedure 
with probability closer to 1, their difference becomes small; for comparing 
a converging nonparametric procedure with a converging parametric one, 
CV-v often performs better. We tend to believe that the differences between 
CV-v and CV-a are second-order effects, which are hard to quantify in gen- 
eral. 



3.4. Are the sufficient conditions on data splitting also necessary for con- 
sistency? Theorem 2 shows that if a single data splitting ensures consis- 
tency, voting based on multiple splittings also works. In the reverse direction, 



12 



Y. YANG 



one may wonder if multiple splittings with averaging can rescue an inconsis- 
tent CV selection with only one splitting. We give a counterexample below. 

In the following example, we consider two simple models which allow 
us to exactly identify the sufficient and necessary conditions for ensuring 
consistency in selection by a CV-a method. Model 1 is Yi = £i, i = 1, . . . ,n, 
where are i.i.d. normal with mean zero and variance a 2 . Model 2 is Yi = 
fj, + £j, i = 1, . . . , n, with the same conditions on the errors. Clearly, model 1 
is a submodel of model 2. Under model 1, obviously f n ,i(x) = 0, and under 
model 2, the maximum likelihood method yields f n ,2(x) =Y. We consider 
the multifold CV with splitting ratio n\ : ri2, where again n\ is the estimation 
sample size and 712 is the evaluation size. Note that for the multifold CV-a 
method, we consider all possible data splittings at the given ratio and the 
averaging of the prediction errors is done over the splittings. Let S denote 
the observations in the estimation set. Then 

cv(j)=e( e^-^A 

S \i£S c ) 
where S is over all possible data splittings with the given ratio, S c denotes 
the complement of S, Y^i = and Y^ = — J2ies ■ 

Proposition 1. A sufficient and necessary condition for the above CV 
to be consistent in selection is that the data splitting ratio satisfies (1) n± — > 
oo, (2) n^jnx — ► oo. 

Note that the conditions in this proposition match those sufficient condi- 
tions in Theorem 1. Consequently, we know that for the consistency prop- 
erty, multiple splittings (or even all possible splittings) do not help in this 
case. Although it seems technically difficult to derive a similar general re- 
sult, we tend to believe that the example represents the typical situation of 
comparing nested parametric models. 

In the context of variable selection in linear regression, Zhang [35] showed 
that for any fixed splitting ratio, none of multifold CV, repeated learning- 
testing and r-fold CV based on averaging is consistent in selection. 

Proof of Proposition 1. With some calculations, under the larger 
model, we have CV(1) = Q^) (n/i 2 + 2/iE"=i e i + EiU^!) and CV(2) 
equals 
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Then with more simplifications, we get that CV(1) — CV(2) equals 

( n \ 




2 



(n-2)!(m + l) /A \ (n-2)!(m+n) 



J 4 mni!(n 2 -l)! J mni!(n 2 -l)! 



When /x 0, it is not hard to show that CV(1) — CV(2) is positive with prob- 
ability tending to 1 if and only if n\ — > oo. When model 1 holds, that is, /x = 0, 
CV(1) - CV(2) > is equivalent to [ ^^{Y^=i > (" + m) EiUfe - 
e) 2 . Since X^=i e i is independent of J2?=i( e j ~^) 2 an d they have normal 
and chi-square distributions, respectively, we know that for the probability 
of the event to go to zero, we must have n\/n — > 0, which is also sufficient. 
This completes the proof of the proposition. □ 

4. Simulation. In this section, we present some simulation results that 
are helpful to understand the differences of several versions of CV and the 
effect of the data splitting ratio. 

We consider estimating a regression function on [0, 1] with three compet- 
ing regression methods. The true model is = f(Xi) + £j, 1 < i < n, where 
Xi are i.i.d. uniform in the unit interval, and the errors are independent of 
Xi and are i.i.d. normal with mean zero and standard deviation a = 0.3. 
The true function is taken to be one of the three functions 

(Case 1) fi(x) = l + x, 

(Case 2) f 2 (x) = 1 + x + 0.7(x - 0.5) 2 , 

(Case 3) f 3 (x) = 1 + x - exp(-200(x - 0.25) 2 ). 

In all these cases, for a not too large, a linear trend in the scatter plot 
is more or less obvious, but when a is not small, it is usually not com- 
pletely clear whether the true function is simply linear or not. We consider 
three regression methods: simple linear regression, quadratic regression and 
smoothing spline. The simulation was conducted using R, where a smoothing 
spline method is provided. We take the default choice of GCV for smoothing 
parameter selection. 

Clearly, in Case 1, the linear regression is the winner; the quadratic re- 
gression is the right one in Case 2 (when the sample size is reasonably large) ; 
and the smoothing spline method is the winner in Case 3. Case 1 is the most 
difficult in comparing the methods because all of the three estimators con- 
verge to the true regression function with two converging at the same para- 
metric rate. Case 2 is easier, where the simple linear estimator no longer 
converges and consequently the task is basically to compare the parametric 
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and the nonparametric estimators (note that the spline estimator converges 
at a slower order). Case 3 is the easiest, where only the smoothing spline 
estimator is converging. 

We consider the following versions of CV: (1) single splitting CV: the data 
splitting is done just once; (2) repeated learning-testing (a version of CV-a), 
denoted as RLT: we randomly split the data into the estimation and evalua- 
tion parts 100 times and average the prediction errors over the splittings for 
each estimator; (3) repeated splittings with voting (a version of CV-v), de- 
noted as RSV: differently from the previous one, we select the best method 
based on each data splitting and then vote to decide the overall winner. 
The sample sizes considered are 100,200,400,800 and 1600. In Case 2, the 
first one or two sample sizes are not considered for the splitting ratios 3 : 7 
and 1 : 9 because the smoothing spline method has difficulty in parameter 
estimation due to the small sample size in the estimation part. In Case 3, 
only the splitting ratios 9 : 1 and 5 : 5 are included because lower splitting 
ratios make the CV methods perform perfectly. Note also that the first two 
sample sizes are not included for the splitting ratio 5 : 5 for the same reason 
as mentioned above. 

The results based on 200 replications are presented in Figures 1-3 for the 
three cases. 

From the graphs, we observe the following: 

1. In Case 1, at a given splitting ratio, the increase of sample size does 
not lead to improvement on correct identification of the best estimator. This 
nicely matches what is expected from Theorem 1: since the simple linear and 
the quadratic regression estimators both converge at the parametric rate, 
no matter how large the sample size is, any fixed proportion is not sufficient 
for consistency in selection. 

2. In Case 2, at a given splitting ratio, when the sample size is increased, 
the ability of CV to identify the best estimator tends to be enhanced. This is 
also consistent with Theorem 1: in this case, the two converging estimators 
converge at different rates and thus a fixed splitting ratio is sufficient to 
ensure consistency in selection. 

3. In Case 3, even at splitting ratio 9: 1, RLT and RSV have little dif- 
ficulty in finding the best estimator. From Theorem 1, in this case, since 
the smoothing spline estimator is the only converging one, the splitting ra- 
tio ii\ : ri2 is even allowed to go to oo without sacrificing the property of 
consistency in selection. 

4. The different versions of CV behave quite differently. Overall, RLT 
seems to be the best, while the single splitting is clearly inferior. In Case 1, 
for the splitting ratios that favor the estimation size, RSV did poorly. How- 
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Splitting Ratio: 9:1 



Splitting Ratio: 5:5 
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Splitting Ratio: 3:7 



Splitting Ratio: 1:9 
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Fig. 1. Probability of selecting the best estimator for Case 1. Coding: ■= repeated learn- 
ing-testing, + = repeated sampling with voting, x = single splitting. The sample sizes are 
100, 200, 400, 800 and 1600. 
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Splitting Ratio: 9:1 



Splitting Ratio: 5:5 





Different Sample Sizes 



Different Sample Sizes 



Splitting Ratio: 3:7 



Splitting Ratio: 1:9 
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Fig. 2. Probability of selecting the best estimator for Case 2. Coding: ■= repeated learn- 
ing-testing, + = repeated sampling with voting, x = single splitting. The sample sizes are 
100, 200, 400, 800 and 1600. 
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Splitting Ratio: 9:1 



Splitting Ratio: 5:5 
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Fig. 3. Probability of selecting the best estimator for Case 3. Coding: ■= repeated learn- 
ing-testing, + = repeated sampling with voting, x = single splitting. The sample sizes are 
100, 200, 400, 800 and 1600. 



18 



Y. YANG 



ever, with higher and higher splitting ratio toward the evaluation part, it 
improved dramatically. This is consistent with the understanding that for a 
sequence of consistent splitting ratios, CV with voting or CV with averaging 
do not differ asymptotically (although their second-order behavior may be 
different). In Cases 2 and 3, the CV-v actually performed similarly to or 
better than the CV-a [the difference is large in Case 2 when the estimation 
sample size is small (40, 50, 60)]. 

In summary, the simulation results are very much in line with the under- 
standing in Section 3. 

5. Concluding remarks. We have shown that under some sensible con- 
ditions on the L2, L4 and Loo norms of / — / for the competing estimators, 
with an appropriate splitting ratio of the data for cross validation, the better 
model/procedure will be selected with probability converging to 1. Unlike 
the previous results on CV that focus either on comparing only parametric 
regression models or on selecting a smoothing parameter in nonparametric 
regression, our result can be applied generally to compare both parametric 
and nonparametric models/procedures. 

The result also reveals some interesting behavior of CV. Differently from 
the parametric model selection case, for comparing two models converging 
at the same nonparametric rate, it is not necessary for the evaluation size in 
data splitting to be dominating. Actually, the proportion of the evaluation 
part can even be of a smaller order than the estimation proportion without 
damaging the property of consistency in selection. An implication is that 
it may be desirable to take the characteristics of the regression estimators 
into consideration for data splitting, which to our knowledge has not been 
seriously addressed in the literature. Based on our result, for comparing two 
estimators with at least one being nonparametric, half-half splitting is a 
good choice. 

Delete- 1 CV is a popular choice in applications. This is usually suitable 
for estimating the regression function. However, when one's goal is to find 
which method is more accurate for the data at hand, the proportion of eval- 
uation needs to be much larger. Further research on practical methods for 
properly choosing the data splitting proportion can be valuable for successful 
applications of cross validation. 

6. Proof of Theorem 1. Without loss of generality, assume that f Ut \ is 
the asymptotically better estimator by Condition 2. Note that 

n 

CV(/ ni)j )= (fm-fn.A^+Zi) 2 
n n 

= E e?+ E OT-Lwf 

i=m + l i=n\+l 
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n 

i=m+l 

Define for j = 1, 2, 

n n 
L j= ]T (f(X i )-f ni , j (X i )f + 2 J2 SiifiX^-fn^iX-)). 
i=ni+l i=n\+l 

Then CV(/ m) i) < CV(/ nii 2) is equivalent to L\ < L 2 and thus also equiva- 
lent to 

n n 

2 £ eitf ni ,2(Xi) - f ni)1 (Xi)) < Y (/W-/m,2W) 2 

i=ni+l i=ni+l 

- jr (/pq) - / ni)1 (^)) 2 . 

i=ni+l 

Conditional on Z 1 and X 2 = (X ni +i, . . . ,X n ), assuming J2i=m+i(f (Xi) — 

ImMXi)) 2 is larger than J2i=m+i(f ( X i) ~ f ni ,i(Xi)) 2 , by Chebyshev's in- 
equality, we have 

P(CV(/ niil )>CV(/ nii2 )|Z 1 ,X 2 ) 

<min(l,4a 2 ]T (/ ni)2 (*i) - f ni ,i{Xi)f 

n 

E (/(^i)-/m.2TO) 2 
, t=»i + l 

n \ 2 

£ (/TO - /ru,lPQ)) 2 

i=ni+l / 

Let Q n denote the ratio in the upper bound in the above inequality and let S n 

be the event of E2=n 1+ i(/(*i) - / m , 2 (^)) 2 > E?=„ 1+ i(/(Xi) " fn^{ X i)?- 
It follows that 

P(CV(/ ni)1 )>CV(/ ril , 2 )) 

= P({CV(/ m ,i) > cv(/ ni)2 )} n s n ) + P({CV(/ ni ,i) > CV(/ ni)2 )} n S c n 

< E(P(CV(f nitl ) > CV(f nit2 )\Z\X 2 )I Sn ) + P(S c n ) 

<Emm(l,Q n ) + P(S^). 

If we can show that P(S^) — > and Q n — > in probability as n — > 00, 
then due to the boundedness of min(l,Q n ) [which implies that the ran- 
dom variables mm(l,Q n ) are uniformly integrable], we have P(CV(/ m) i) > 
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CV(/ ni) 2)) converges to zero as n — ► oo. Suppose we can show that for every 
e > 0, there exists a e > such that when n is large enough, 

D f J2i= ni +l(f ( X i) - /m,2pQ)) 2 ^ -. . \^-, 



^ =ni+ lU(X i )-fn 1 AX l )) 2 

Then P(S n ) > 1 — e and thus P(S%) — > as n — ► oo. By the triangle inequal- 
ity, 

E (fm,2( X i)-fmA X i)) 2 

i=ri\+l 

n n 

<2 E (f(Xi)-fm,i(Xi)) 2 + 2 E (/M-/m,2(^i)) 2 . 

i=m+l i=ni+l 

Then with probability no less than 1 — e, Q n is upper bounded by 

8a 2 (E" = n 1+ l (/(*«) ~ / ni ,l(^i)) 2 + £IU 1+ l(/(*i) ~ /m,2(*i)) 2 ) 

((1 - 1/(1 + a t )) E? =rH+ i(/PQ) - f ni ,2( X i)) 2 ) 2 

(6.2) 

8a 2 (l + l/(l + a e )) 



< 



;i - 1/(1 + a e )f E?=„ 1+ i(/PQ) - fm^W 



From (6.1) and (6.2), to show P(S^) — > and Q n — > in probability, it 
suffices to show (6.1) and 

n 

(6.3) E (f( x i) ~ fn u 2(Xi)Y -> oo in probability. 

j=m+l 

Suppose a slight relaxation of Condition 1 holds: for every e > 0, there 
exists A ni>e such that when n\ is large enough, P(||/ — /mj||oo > ^m,e) < e 
for j = 1,2. Let H ni be the event {max(||/- /m.ilU, ||/-/m,2||oo) < An,e}- 
Then on we have Wi = (/(Xj) - / m tj (Xi)) 2 - \\f - fnujWl is bounded 
between — (A„ lj£ ) 2 and (^4 ni)(E ) 2 . Notice that conditional on Z 1 and H ni , 

Var zl (W ni+1 ) < E zl (f(X ni+l ) - J ni)j {X ni+l )f = \\f - f nuj \\j, 

where the subscript Z 1 in Var^i and E z i is used to denote the conditional 
expectation given Z . Thus conditional on Z , on H ni , by Bernstein's in- 
equality (see, e.g., Pollard [18], page 193), for each x > 0, we have 

PzA E (f(X l )-f nul (X l )) 2 -n 2 \\f-f nul f 2 >x\ 
\i=m+l ) 

( 1 x 2 
< exp — - 



2 n 2 ||/-/ ni , 1 ||l + (2(^ 1 , e )V3) 
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Taking x = f3 n ri2\\f — /m.illii the above inequality becomes 

PzA E (f(X t )-fn u i(X l )) 2 >(l+/3 n )n 2 \\f-f nul \\l\ 

\i=ni+l / 

i /fell/ -/nail! 



< exp 



2 11/ - f ni ,i\\i + (2(A ni!e yp n /3)\\f - / nijl ||l 



Under Condition 2, for every e > 0, there exists a' e > such that when n 
is large enough, P(\\f - /m^Hl/ll/ ~ /m.llll < 1 + «' e ) < e- Take /?„ such 
that l + / 3 n= ||/-/ m)2 |||/((i + a //2)||/-/ niil ||2). Then with probability 

at least 1 — e, /3 n > (a' e /2)/(l + a e /2). Let D n denote this event and let 
SI = £?=n 1+ i (/(**) " /m.iM) 3 - Then on £ n we have 

/?n > «' e ||/ - /n ll2 ||i/(2(l + a'J(l + 4/2)11/ - / ni)1 |||), 
P z i(51>(l + /3 n )n 2 ||/-/ ni)1 ||i) 

= P z {si> T ^ 2 \\f-f ni M\l 

<P Z1 (Sl>(l + <Wf ~ fniM\ 2 2 W 2 ||/-/ ml ||| 

" V -V 2(l + «0(l + «' e /2)||/-/ T , 1 , 1 ||l J 



< exp 



8(l + a0 2 (l + a^/2) 2 

X ni,2||2 



n 2 ||/-/. ~" 4 



11/ - /„x,i|lt + (4(^n,,e) 2 /(3(l + a'J(l + a' e /2)))||/ - /„ lj2 )|| 2 

If we have 

fa a\ n2 Wf ~~ f n lM\2 ■ i , 

(6.4) — — ► oo m probability 

11/ - fni,l\U 

fa K\ n2 Wf ~ fmMll • i i 

(6.5) j- ► oo m probability 

then the upper bound in the last inequality above converges to zero in prob- 
ability. From these pieces, we can conclude that 

P( £ (/(X0-/ ni ,l(X,)) 2 >— ^— ||/-/n ll2 ||l) 

(6.6) Vi=m+i i + a e /^ / 

< 3e + A(e,n), 
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for some A(e, n) — > as n — > oo. Indeed, for every given e > 0, when n is 
large enough, 

E (/(^)-/m,l(^)) 2 > r r^ll/-/m,2||i) 

\ «=ni+l e ' / 

<P(fl^) + P(D«) 

+ p(p ni nD n n j — E (/(^)-/m,i(^)) 2 



<3 e + pp(p ni nL> n n ji- E (/(^)-/m,i(^)) 2 



j=ni+l 

1 



> 



1 + a' £ /2 



/rii^lli 



Z 1 



(a) 2 

< 3e + Pexp' 



l + a0 2 (l + «e/2) 2 



11/ " /n^llH + «0W) 2 /(3(1 + «' e )(l + </2)))||/ - / m , 2 ||! 
= 3e + A(e,n), 

where the expectation in the upper bound of the last inequality above [i.e., 
A(e,n)] converges to zero due to the convergence in probability to zero 
of the random variables of the exponential expression and their uniform 
integrability (since they are bounded above by 1), provided that (6.4) and 
(6.5) hold. The assertion of (6.6) then follows. 

For the other estimator, similarly, for < (3 n < 1, we have 

Pz\ E (/(^0-/ni,2M) 2 <(l-^)n 2 ||/-/„ lj2 ||i) 



2 11/ - fmM + (2(A,„,m,/3)||/ - /„,2||| 



If we have 



(6.7) ► oo m probability, 

11/ - /m,2|U 
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la o\ n 2^n\\f - /ni,2||| • uu-r+ 

(6.8) 7~~a \2 y 00 111 probability, 

then following a similar argument used for jf ni 1, we have 

(6.9) p( J2 (/M-/»i,2(^i)) a <(l-^)^||/-/n 1|3 ||l)-^0. 

\j=ni+l / 

From this, if n^Wf — /m,2||| — > 00 in probability and (3 n is bounded away 
from 1, then (6.3) holds. If in addition, we can choose (3 n — > 0, then for each 
given e, we have (1 - /3 n )\\f - fn^Wl > (1+^)2) 11/ ~ /m^lli for some sma11 
a e > when ni is large enough. Now for every e > 0, we can find e > 
such that 3e < e/3 and there exists an integer no such that when n > no the 
probability in (6.9) is upper bounded by e/3 and A(e, n) < e/3. Consequently 
when n > no, 

r ,fY l i=n 1 +l(f( X i)-fn 1 ,2(X i ))' 2 \ „ 

P 1 - > 1 + a e > 1 — e. 

Recall that we needed the conditions (6.4), (6.5), (6.7) and (6.8) for (6.1) 
to hold. Under Condition 3, n 2 ||/ — fn^Vz/Wf ~ /ni.lllt is lower bounded in 
order in probability by n 2 ||/ - fnu2\\i/( M ni 11/ _ fni,l\\i)- From a11 above, 
since f ni ,i and f m ,2 converge exactly at rates p n and q n , respectively, under 
the L2 loss, we know that for the conclusion of Theorem 1 to hold, it suffices 
to have these requirements: for every e > 0, for some (3 n — > 0, we have that 
each of n 2 /3 2 M~ x 4 , n 2 {q ni / 'p ni ) A , n 2 j3 n ql 1 (^ nii£ )" 2 , n 2 ^ 1 (A nii£ ) -2 and 
n 29ni S oes *° infinity. 

Under Condition 1, for every e > 0, there exists a constant B e > such 
that P(\\f — fm,j\\oo > B e A ni ) < e when ni is large enough. That is, for a 
given e > 0, we can take A nij€ = 0(A ni ). Therefore if we have n 2 M~^ — > 
00 and mq\J(A + A ni ) — > 00, then we can find j3 n — > such that the five 
requirements in the previous paragraph are all satisfied. This completes the 
proof of Theorem 1. 
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